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Abstract — We propose a general methodology for perform- 
ing statistical inference within a 'rare-events regime' that was 
recently suggested by Wagner, Viswanath and Kulkarni. Our 
approach allows one to easily establish consistent estimators 
for a very large class of canonical estimation problems, in a 
large alphabet setting. These include the problems studied in 
the original paper, such as entropy and probability estimation, 
in addition to many other interesting ones. We particularly 
illustrate this approach by consistently estimating the size of 
the alphabet and the range of the probabilities. We start by 
proposing an abstract methodology based on constructing a 
probability measure with the desired asymptotic properties. We 
then demonstrate two concrete constructions by casting the Good- 
Turing estimator as a pseudo-empirical measure, and by using 
the theory of mixture model estimation. 

I. Introduction 

We propose a general methodology for performing statistical 
inference within the 'rare-events regime' suggested by Wagner, 
Viswanath and Kulkarni in (TJ, referred to as WVK hereafter. 
This regime is a scaling statistical model that strives to capture 
large alphabet settings, and is characterized by the following 
notion of a rare-events source. 

Definition 1. Let {(A n ,p n )} n ^ be a sequence of pairs 
where each A n is an alphabet of finite symbols, and p n is 
a probability mass function over A n . Let X n be a single 
sample from p n , and use it to define a 'shadow' sequence 
Z n = np n (X n ). Let P n denote the distribution of Z n . We 
call {{A n ,p n )} ri £fi a rare-events source, if the following 
conditions hold. 

(i) There exists an interval C — [c,c], < c < c < oo, 
such that for all n G N we have - < p n (a) < — for all 
a G A n , or equivalently, P n is supported on C. 

(ii) There exists a random variable Z, such that Z n — > Z in 
distribution. Equivalently, there exists a distribution P, 
such that P n => P weakly. 

To complete the model, we adopt the following sampling 
scheme. For each n, we draw n independent samples from 
p n , and we denote them by X n ,i, ■ ■ ■ ,X n>n . Using these 
samples, we are interested in estimating various quantities. 
WVK consider, among a few others, the following: 

• The total (Good-Turing) probabilities of all symbols 
appearing exactly k times, for each k £ No. 

• The normalized log-probability of the observed sequence. 



• The normalized entropy of the source. 

• The relative entropy between the true and empirical 
distributions. 

They also consider two-sequence problems and hypothesis 
testing, but we focus here on single sequence estimation. 

It is striking that many of these quantities can be estimated 
in such a harsh scaling model, where one cannot hope for 
the empirical distribution to converge in any traditional sense. 
However, WVK's estimators have some drawbacks. For exam- 
ple, since they are based on series expansions of the quantities 
to be estimated, one has to carefully choose the growth rate of 
partial sums, in order to control convergence properties. More 
importantly, they are specifically tailored to each individual 
task. Their consistency is established on a case-by-case basis. 
What is desirable, and what this paper contributes to, is a 
methodology for performing more general statistical inference 
within this regime. Ideally such a framework would allow one 
to tackle a very large class of canonical estimation problems, 
and establish consistency more easily. 

We may summarize the fundamental ideas behind our 
approach and the organization of this paper as follows. First, 
in Section [ill we isolate the class of estimation problems that 
we are interested in as those that asymptotically converge to 
an integral against P. The quantities studied by WVK fall in 
this category, and so do other interesting problems such as 
estimating the size of the alphabet. Other problems, such as 
estimating the range of the probabilities given by the support 
interval C, can also be studied in this framework. 

Next, in Section [TTH we propose an abstract solution 
methodology. At its core, we construct a (random) distri- 
bution P n that converges weakly to P for almost every 
observation sample. This construction immediately establishes 
the consistency of natural estimators for the abovementioned 
quantities, if bounds on C are known. If in addition the rate 
of the convergence of P n is established, the framework gives 
consistent estimators even without bounds on C. 

To make this methodology concrete, we build on a core 
result of WVK that establishes the strong consistency of the 
Good-Turing estimator. In particular, since the role of the 
empirical measure is lost, we show in Section [IV] that we 
can treat the Good-Turing estimator as a pseudo-empirical 
measure. Once this is established, we can borrow heavily from 
the theory of mixture models, where inference is done using 



i.i.d. samples, and adapt it to our framework. In Section [V] 
we suggest two approaches for constructing P n : one that is 
based on maximum likelihood, and another that is based on 
minimum distance. Both constructions guarantee the almost 
sure weak convergence of P n to P, but the latter, under some 
conditions, also provides the desirable convergence rates. 

In Section [VI] we illustrate the methodology with some 
examples. In particular, we show how one can consistently 
estimate the entropy of the source and the probability of the 
sequence as studied by WVK, but we also propose consistent 
estimators for the size of the alphabet and for the support 
interval C. 

Notation: Throughout, we use F(.; .) to denote the cumula- 
tive distribution of the second argument (which is a probability 
measure on the real line or on the integers) evaluated at the 
first argument (which is a point on the real line or an integer). 

II. A General Class of Estimation Problems 
A. Definitions 

Given i.i.d. samples X n .i,--- ,X niTl from the rare-events 
source (A n ,p n ), we can pose a host of different estimation 
problems. Since the alphabet is changing, quantities that de- 
pend on explicit symbol labels are not meaningful. Therefore, 
one ought to only consider estimands that are invariant under 
re-labeling of the symbols in A n . In particular, we consider 
the following class of general estimation problems. 

Definition 2. Consider the problem of estimating a sequence 
{Y n }nefi of real-valued random variables using, for every 
n, the samples X n \,--- ,X n<n , We call this a canonical 
estimation problem if, for every rare-events source, we have: 

E[Y n ] = / f n (x)dP n (x). (1) 
Jc 

for some sequence {/„} of continuous real-valued functions 
on K + that converge pointwise to a continuous function /. 

It is worth noting that it follows that {/„} and / are also 
bounded on every closed interval [a, b], < a < b < oo. Ob- 
serve that this definition corresponds indeed to estimands that 
are invariant under re-labeling, in expectation. The following 
lemma characterizes the limit. 

Lemma 1. For any canonical estimation problem, 

EM f f(x)dP(x). (2) 
Jc 

Proof: Since P n =>• P, we can apply Skorokhod's 
theorem (0, p. 333), to construct a convergent sequence of 
random variables — 5> a s . £, where £„ ~ P n and £ ~ P. By 
continuity, it follows that /„(£ n ) — > a .s. /(£)• By the bounded 
convergence theorem, we then have E[/„(£„)] — > E[/(£)]. 
Since E[Y„] = E[/„ (£„)], and J c f(x)dP(x) = E[/(f)], the 
lemma follows. ■ 
It is often more interesting to consider the subclass of 
canonical problems where there is strong concentration around 
the mean, and where the Borel-Cantelli lemma applies to give 
almost sure convergence to the mean. 



Definition 3. If a canonical estimation problem further sat- 
isfies \Y n — E[Y n ]\ — s-a.s. 0, then call it a strong canonical 
problem. It follows that for strong canonical problems, 

*Wa. S . f f(x)dP(x). (3) 

Jc 

Using these definitions, a reasonable estimator will at least 
agree with the limit set forth in Lemma Q] Other modes of 
convergence may be reasonable, but we would like to exhibit 
a statistic that almost surely converges to that limit. We make 
this precise in the following definition. 

Definition 4. Given a canonical problem as in Definition |2j a 
corresponding estimator is a sequence {Y n } n ^ such that, for 
each n, Y n (a\, ■ ■ ■ ,a n ) is a real-valued function on (A n ) n , 
to be evaluated on the sample sequence X n< x, ■ ■ ■ ,X n>n . A 
consistent estimator is one that obeys 

Y n (X nA ,--- ,X n , n ) -> a .„. / f(x)dP(x). (4) 

Jc 

For canonical estimation problems that are not necessarily 
strong, this approach produces an asymptotically unbiased 
estimator, with asymptotic mean squared error that is no 
more than the asymptotic variance of the estimand itself. 
For strong canonical estimation problems, this approach es- 
tablishes strong consistency, in the sense that the estimator 
converges to the estimand, almost surely. 

B. Examples 

To motivate the setting we have just described, we first note 
that all of the quantities studied by WVK are strong canonical 
estimation problems. For each quantity, WVK propose an es- 
timator, and individually establish its consistency by showing 
almost sure convergence to the limit in Lemma Q] In contrast, 
what we emphasize here is that this can potentially be done 
universally over all strong canonical problems. 

To highlight the usefulness of this generalization, we illus- 
trate two important quantities that fall within this framework. 
We will revisit these in more detail in Section [VlJ The first 
quantity is the normalized size of the alphabet: |A„|/n. For 
this, one can show (see, for example, J3|), that |A n |/n = 
J c ~ dP n (x). Therefore we can take f n (x) — f(x) = A, and 
since the estimand is deterministic, we have a strong canonical 
estimation problem. 

The second quantity of interest is the interval C, or equiv- 
alently its endpoints c and c. Note that, by construction, P is 
supported on C. Without loss of generality, we may assume 
that c and c are respectively the essential infimum and essential 
supremum of Z ~ P. Therefore, note that (J x ±q dP(x)) 
converges to the essential infimum (— ) or supremum (+) 
as q — > oo. We can therefore consider, for fixed q > 1, 
the strong canonical problems that ensue from the choices 
f n (x) = f(x) = x~ q and f n (x) = f(x) = x q . These, by 
themselves, are not sufficient to provide estimates for c and 
c. However if, in addition to consistency, we establish the 
convergence rates of their estimators, then we can apply our 
framework to estimate C, as we show in Section [VT] 



III. Solution Methodology 

Our task now is to exhibit consistent estimators to canonical 
problems. We present here our abstract methodology, which 
we demonstrate concretely in Section [V] The core of our 
approach consists of using the samples X n ,i,--- ,X n , n to 
construct a random measure P n over R + , such that for 
almost every sample sequence, the sequence of measures {P n } 
converges weakly to P. We write: asn^oo 

Pn =* M . P (5) 

If we accomplish this, we can immediately suggest a 
consistent estimator under certain conditions, as expressed 
by Lemma We will be interested in integrating functions 
against the measure P n . However, since the support C of P is 
unknown, we first introduce the notion of a tapered function 
as a convenient way to control the region of integration. Given 
a real-valued function g(x) on for every D > 1 define its 
£>-tapered version as: 

( g{D~ l ) x<D~ l 
g D (x)= \ g(x) xe[D-\D] 
[ g{D) x>D 

If g is continuous on (0, +oo), then we can think of go(x) 
as a bounded continuous extension of the restriction of g on 
[D- 1 ^] to all of R+. 

Lemma 2. Consider a canonical problem characterized by 
some f. Let the support C of a rare-events source be known 
up to an interval [Z) _1 ,Z3] D C for some D > 1. Then, if 



n — ^ a.s. 



P as n — > oo, we have that 



Y, 



f D (x) dP n (x) 



(6) 



is a consistent estimator. 

Furthermore, if f is bounded everywhere, we can make the 
uninformative choice D = oo. 

Proof: Since the tapered function fjj is continuous and 
bounded on K + , the almost sure weak convergence of P n 
to P implies that J R+ f D dP n -> a . s . J R+ f D dP. But since 
P is supported on C and Jo agrees with / on C, we have 

J R+ f D dP = J c f D dP = J c fdP. m 

In general, however, we will be interested in problems where 
we do not have an a priori knowledge about the endpoints of 
C, and where an uninformative choice cannot be made because 
/ is not bounded on M + , such as f(x) — log x, 1/x, or x q . For 
these problems, we can apply our methodology of integrating 
against P n by first establishing a rate for the convergence of 
equation ©. We characterize such a rate using a sequence 
K n — > oo, such that: 

K n d w {Pn,P) ~> a . s .0, (7) 

where d?w denotes the Wasserstein distance, which can be 
expressed in its dual forms: 



d W {Pn,P) 



\F(x;P n ) -F(x;P)\dx 



sup 

/ifELipschitz(l) 



hdP„ 



hdP 



(8) 



In the remainder of the paper we will particularly focus on 
K n of the form n s for some s > 0. 

In the following lemma, we describe how we can use 
convergence rates such as (0 to construct consistent estimators 
that work with no prior knowledge on C, for a large subclass 
of canonical problems. 

Lemma 3. Consider a canonical problem characterized by 
some f, which is Lipschitz on every closed interval [a, b], 
< a < b < oo. If K n dw(P n , P) — ?> a .s. as n — > oo, for 
some K n — > oo, then we can choose D„ — > oo such that 



Yn 



f Dn (x) dP n (x) 



(9) 



is a consistent estimator. The growth of D n controls the growth 
of the Lipschitz constant of fo n , which should be balanced 
with the convergence rate K n . More precisely, Y n in (O is 
consitent for any D n — > oo that additionally satisfies 

liminf - K 1 > 0, (10) 

where Lip(g) indicates the Lipschitz constant of g. 

Proof: First note that for any D > (cT 1 V c), since P is 
supported on C and fr> agrees with / on C, we have: 



f D dP= / f D dP= / fdP (11) 

JC JG 

Then, using the fact that for every D, /d/Lip(/d) is 
Lipschitz(l), we can invoke the dual representation (O of 
the Wasserstein distance to write: 



K n sup 



1 



d Lip(/ D ) 



f D dPn 



K+ 



fodP 



0. (12) 



By combining equations ( fTTT i and (fT2l . it follows that for 
any sequence D n — > oo, we have: 



Lip(/u») 



f Dn dP n - / fdP 



0. 



(13) 



If furthermore D n is chosen such that equation (TTOb is 



satisfied, then the factor 



K» 



Lip(/o„ 



is eventually bounded away 



from zero, and can be eliminated from equation ( TT3b to lead 
to the convergence of the estimator. ■ 
Of course, there may be more than one way in which one 
could construct P n . In this paper, we focus on demonstrating 
the validity and usefulness of the methodology by providing 
two possible constructions. The results would remain valid 
regardless to the specific construction, and other constructions 
boasting more appealing properties, such as rates of conver- 
gence under more lenient assumptions, are welcome future 
contributions to this framework. 

IV. The Good-Turing Pseudo-Empirical Measure 

A. Definitions and Properties 

The platform on which we build our estimation scheme 
is the Good-Turing estimator, and in particular its strong 
consistency established by WVK. In this section, we review 
the main definition and properties relevant to the rest of 



the development. Let B n ^ be the subset of symbols of A n 
that appear exactly k times in the samples X n> x, ■ ■ ■ ,X n , n . 
The Good-Turing estimation problem, in reference to the 
pioneering work of Good in |4], is the estimation of the 
quantities 7^ = p n (B n .k), for each k = 0, 1, • ■ • ,n, that is 
the total probability of all symbols that appear exactly k times. 
We can group these with the notation j n = {"f n ,k}k£N > which 
we pad with zeros for k > n. In particular, Good suggests the 
following estimator. 

Definition 5. Let ip n ,k = \B n ,k\ be the number of symbols of 
A n that appear k times in X n) i, • ■ ■ , X n ^ n . The Good-Turing 
estimator <p n = {4>n,k}k£N °f 7n> f° r eacn & ^ ^o> is 

(fc + l)(p n ,k+i 

<Pn,k = ■ (14) 

n 

WVK establish a host of convergence properties for the 
Good-Turing estimation problem and the Good-Turing esti- 
mator. We group these in the following theorem. 

Theorem 1. Define the Poisson F '-mixture A = {Afe}/cgN as, 
for each S No : 

\k= / =-^-dP(x). (15) 

J c 

We then have the following results that determine the limiting 
behavior of 7„, and the strong consistency of the Good-Turing 
estimator <j) n : 

(i) We have that 7„ :fe -> a . s . \ k and 4>n,k "^a.s. A fe , and 
therefore \4>n,k~ 7n,fe| ~^a.s. 0, pointwise for each k £ No 
as n — > 00. 

(ii) By Scheffe's theorem ([2], p. 215), it also follows that 
these convergences hold in L\ almost surely, in that 
Win — A||i — >a.s. and \\<j) n — \\\\ — >- a . s , 0, and therefore 

\\(j>n ~ 7n]|l -^a.s. 0, as Tl ?> 00. 

5. Empirical Measure Analogy 

The analogy that we would like to make in this section is the 
following. Assuming A is given, one could take n i.i.d. samples 
from it, and form the empirical measure or the type, call it 
A„ = {An^j-fcgNo- Such an empirical measure would satisfy 
well-known statistical properties, in particular the strong law 
of large numbers would apply, and we would have A n ,fc — >- a . s . 
Afe. By Scheffe's theorem, L\ convergence would also follow. 
It is evident from Theorem [T] that despite the fact that we 
do not have such a true empirical measure, the Good-Turing 
estimator <f> n behaves as one, and we may be justified to call 
it a pseudo-empirical measure. 

Now observe that since, for discrete distributions, the 
total variation distance is related to the L\ distance by 
sup BcNo |A„(P) - \{B)\ = ±||A„ - A|| i, the true empir- 
ical measure also converges in total variation. As a spe- 
cial case, the Glivenko-Cantelli theorem applies in that 
sup fc \F(k; A) - F(k; A„)| ->- a . s . 0. Recall that F(.; .) denotes 
the cumulative of the second argument (a measure) evaluated 
at the first argument. In light of the above, this remains valid 
for the pseudo-empirical measure. However, for the classical 



empirical measure, we also have the rate of convergence in 
the Glivenko-Cantelli theorem, in the form of the Kolmogorov- 
Smirnov theorem and its variants for discrete distributions, see 
for example |5 |. Such results are often formulated in terms of a 
convergence in probability of rate So we next ask whether 
such rates hold for the pseudo-empirical measure as well. 

We first note that the rare-events source model is lenient, 
in the sense that it does not impose any convergence rate on 
P n => P. Therefore, convergence results that aim to parallel 
those of a true empirical measure will depend on assumptions 
on the rate of this core convergence. In particular, let us assume 
that we know something about the weak convergence rate of 
P n to P in terms of the Wasserstein distance, in that we 
assume there exists an r > such that 

n r d w (P n ,P) ->0. 

For example, in Lemma [5] we will show that this holds true 
for a class of rare-events sources suggested by WVK. 

Next, note that Lemma 11 in WVK gives the following 
useful concentration rate for the pseudo-empirical measure 
around its mean. 

Lemma 4. For any 5 > 0, n 1/2 ~ s \\<j) n - E[<^ n ]||i -> a . s . 0. 

In the following statement, we show that a Kolmogorov- 
Smirnov-type convergence to A does hold for the pseudo- 
empirical measure <p n , with a rate that is essentially the slower 
of that of the concentration of Lemma |4] and that of the rare- 
events source itself. 

Theorem 2. Let r > be such that n r dw(P n , P) — > 0. Then 
for any 5 > 0, we have: 

n mi„{r, 1/2}-* sup | F(fc . A) _ p{k . W | ^ a g Q _ (16) 

k 

Proof: For convenience, define = {0, • • • , k). The 
proof requires three approximations. The first is to approxi- 
mate (f> n with E [4> n ]. This is already achieved using Lemma 
|4] Since the L\ distance is twice the total variation distance, 
and specializing to the subsets B k , we have that for all <5 > 0: 

n 1 ' 2 - 8 sup \F{k- E[^J) - F(k; <f> n )\ -> a . s . 0. (17) 

k 

The next two approximations are fa) to approximate E [</>„] 
with a Poisson P n -mixture (using the theory of Poisson ap- 
proximation), and (b) to approximate the latter with A, which 
is a Poisson P-mixture (using the convergence in dw(P n , P)). 

Part (a) - For convenience, let 7r„ be a Poisson(a;) P n - 
mixture, and let r\ n be a Binomial n) P n -mixture. One 
can show, as in the proof of Lemma 7 of WVK, that E [</>„] 
is a Binomial n — l) P„-mixture. We first relate E[0„] to 
T] n which is the natural candidate for Poisson approximation. 
We then use Le Cam's theorem to relate r\ n to 7r„. 

We start with a general observation. Let J? = {f(-;x) : 
x 6 C} and <S = {g(-; x) : x G C} be two parametric 
classes of probability mass functions over No, e.g. Poisson 
and Binomial, and let Q be a mixing distribution supported 
on C. Say that for some subset B C No, we have the pointwise 



bound \f(B; x) — g(B; x)\ < £(x). It follows that the mixture 
of the bound is also a bound on the mixture. More precisely: 



f(B;x)dQ(x) 



g{B-x)dQ{x) 



< 



l{x)dQ{x) 



(18) 

Note that if the pointwise bound above holds uniformly over 
B, then the same is true for the mixture bound. We will use 
this particularly with the subsets Bk, to bound the difference 
of cumulative distribution functions. 

Now let g n (k;x) be the c.d.f. of a Binomial (~, n) 
random variable, and let g n (k;x) be the c.d.f. of a 
Binomial n — l) random variable. For any given k, we 
have the following: 



(l - -) 9n(k;x) 



n / 

k 

En - 
? 

m=0 

9n(k;x) 



m l n 
. m 



1 



m=0 



1 - 
X 

n 



1 



\ n—m 

n ' 



Using the facts that the sum is no larger than the mean and 
that g n (k; x) < 1, it follows that for any given k we have: 

\g n (k;x) -g n (k;x)\ 



1 k 

1 x -s / n 
— > m 



m=0 



1 - 



g n (k;x) 

n 



x 

< - 



Note that j c g n (k; x) dP n — F(k; r) n ), the c.d.f. of r\ n , and 
J c g„(k; x) dP n = F(k; E[0„]), the c.d.f. of E [</>„]. Using the 
observation leading to equation (list , it follows that: 



sup |F(fc;E [</,„]) -F{k; Vn )\ < - 
k n 



xdP n (x) < -. (19) 



Using Le Cam's theorem (see, for example, [6Q, we know 
that the total variation distance, and hence the difference 
of probabilities assigned to any subset B C No by a 
Poissonfx) distribution and a Binomial n) distribution is 
upper-bounded by 2- We apply this to the subsets Bk, and use 
the observation leading to equation ( fT8l ) once again to extend 
this result to the respective P„-mixtures: 



1 



(20) 



sup|F(fc;^ n ) -F(k;r) n )\ < - / x 2 dP n (x) < - 

By combining equations ( fT9l and < T20b , we deduce that for 
all 5 > 0: 



sup\F{k;-E[<fi n ])-F(k;Tr n )\^0. 



(21) 



Part (b) - Now let h(k; x) be the c.d.f. of a Poisson(x) 
random variable. Observe that: 







< 



da? 



h(k; x) 



k 

E 

m=0 



1 A x m e 
— ? rn 



m=0 



-E [Poisson(cc)] = 1. 



Therefore, when viewed as a function of x, h(k; x) is 
a Lipschitz(l) function on C for all k. Using the dual 
representation of the Wasserstein distance, we then have: 



sup|FO;7r n )-F(M)| 

k 



sup 

fc 



h(k; x) dP n {x) 



h(k;x) dP(x) 



< sup 

/iGLipschitz(l) 



hdP n 



hdP 



d W (Pn,P). 



Using the assumption of the convergence rate of P n to P, it 
follows that for all 5 > we have: 



n r - 5 sup \F{k-TT n ) - F(k;X)\ 



0. 



(22) 



The statement of the theorem follows by combining equations 
CPS, (ED, and <|22j!. ■ 
In a practical situation, one would expect that the rare- 
events source is well-behaved enough that r > 1/2, and that 
the bottleneck of Theorem [2] is given by the 1/2 rate, and 
therefore we have a behavior that more closely parallels a 
true empirical measure. Indeed, some natural constructions 
obey this principle. Most trivially, for a sequence of uniform 
sources, e.g. if p n (a) = l/n, we have P n = P, and therefore 
r = oo. More generally, consider the following class of rare- 
events sources suggested by WVK. 

Definition 6. Let g be a density on [0,1] that is continuous 
Lebesgue almost everywhere, and such that c < g(w) < c 
for all w £ [0, 1]. Let A n = {1, • • • , [a^J} f° r some a > 0, 
and for every a g A n let p n (a) = J^J}"y[ an] g{w)dw. One 
can then verify that {(A n ,p n )} is indeed a rare-events source, 
with P being the law of g(W), where W ~ g. We call such 
a construction a rare-events source obtained by quantizing g. 

Lemma 5. Let g be a density as in Definition |6] and let 
{(A n ,p n )} be a rare-events source obtained by quantizing g. 
If g has finitely many discontinuities, and is Lipschitz within 
each interval of continuity, then for all r < 1: 

n r d w (P n ,P) -> 

Proof: Without loss of generality, assume a = 1, and 
that the largest Lipschitz constant is 1. Consider the quantized 
density on [0,11: r n , 

/•\wn\/n 

g n (w) = n I g(v)dv, 

J([wn]-l)/n 

where the integral is against the Lebesgue measure. Then it 
follows that P n is the law of g n (W n ), where W n ~ g n . 

Say g has L discontinuities, and let D n be the union of 
the L intervals of the form [(a — l)/n,a/n] which contain 
these discontinuities. In all other intervals, we have that 
\g(w) — g n (w)\ < l/n, using Lipschitz continuity and the 
intermediate value theorem. It follows that 



[0,1] 



\g(w) - g n (w)\dw 



\g-g n \dw- 

D n J[0,1]\D„ 



L 1 
g n \ dw < — + -. 

n n 



For any particular j£C, let B x 
We then have 



\F(x;P n )-F{x;P)\ = 
< 

By integrating over all x: 



{w G [0, 1] : g(w) < x}. 



g(w) - g„(w) dw 



\g{w) - g n {w)\ dw < -^tl. 



(i + l)(c-c) 



dw( J P„, J P)= / \F(x;P n )-F{x;P)\dx < 

Jc n 

Therefore the lemma follows. ■ 
We end by remarking that the rare-events sources covered 
by Lemma |5] are rather general in nature. For example, all of 
the illustrative and numerical examples offered by WVK are 
special cases (more precisely, they have piecewise-constant g). 

V. Constructing P n via Mixing Density Estimation 

We would now like to address the task of using 
X n ,i, ■ ■ ■ ,X n>n to construct a sequence of probability mea- 
sures P n that, for almost every sample sequence, converges 
weakly to P, as outlined in Section [Til] Since we have 
established the Good-Turing estimator as a pseudo-empirical 
measure issued from a Poisson P-mixture, in both consistency 
and rate, this is analogous to a mixture density estimation 
problem, with the true empirical measure replaced with the 
Good-Turing estimator 4> n . 

We start by noting that the task is reasonable, because the 
mixing distribution in a Poisson mixture is identifiable from 
the mixture itself. This observation can be traced back to 
and (8). Then, the first natural approach is to use non- 
parametric maximum likelihood estimation. In Section IV-AI 
we use Simar's work in f9) to construct a valid estimator 
in this framework. Unfortunately, to the best of the authors' 
knowledge, the maximum likelihood estimator does not have 
a well-studied rate of convergence on the recovered mixing 
distribution. In Section IV-BI we consider instead a minimum 
distance estimator, with which Chen gives optimal rates of 
convergence in [10|, albeit by assuming finite support for P. 

A. Maximum Likelihood Estimator 

We first define the maximum likelihood estimator in our 
setting. Despite the fact that it is not, strictly speaking, 
maximizing a true likelihood, we keep this terminology in 
light of the origin of the construction. 

Definition 7. Given the pseudo-empirical measure (Good- 
Turing estimator) <f> n the maximum likelihood estimator of the 
mixing distribution is a probability measure P™ h on K + which 
maximizes the pseudo-likelihood as follows: 



P 



ML 



6 argmax 
Q 



OO 

£« 

fc=0 



log 



h - 

are 



k\ 



■dQ{x) 



(23) 



It is not immediately clear whether P^ h exists or is unique. 
These questions were answered in the affirmative in J9'|. On 
close examination, it is clear that these properties do not 



depend on whether we are using a pseudo-empirical measure 
instead of a true empirical measure. Hence they remain valid in 
our context. Next, we establish the main consistency statement. 

Theorem 3. For almost every sample sequence, the sequence 
|P* a } converges weakly to P as n — > oo. We write this as 

fbML . p 

Proof: The main burden of proof is addressed by Theorem 
Q] in establishing the strong law of large numbers for the 
pseudo-empirical measure, and which is originally given in 
WVK's Proposition 7. Indeed, in Simar's proof ([9|, Section 
3.3, pp. 1203-1204), we only use the fact that 0„ fe — » a .s. ^fc 
for every G No. The rest of the proof carries over, and the 
current theorem follows. ■ 
It is worth noting that the consistency of the maximum 
likelihood estimator does not even require that condition (i) 
in the Definition Q] of the rare-events source to hold, since 
Theorem Q] in fact holds without that condition. In that sense, 
it is very general. However, when every neighborhood of 
or oo has positive probability under P, it limits the types 
of functions that we can allow in the canonical problems, 
including sequence probabilities and entropies as discussed in 
WVK. When P is not compactly supported, it is also difficult 
to establish the rates of convergence. 

B. Minimum Distance Estimator 

We now define a minimum distance estimator for our 
setting. The reason that we suggest this alternate construction 
of P„ is that it is useful to quantify the convergence rate to 
P, and the minimum distance estimator provides such a rate. 
However, it does so with the further assumption that P has a 
finite support, whose size is bounded by a known number m. 

Also note that the definition of the estimator circumvents 
questions of existence by allowing for a margin of e from the 
infimum, and does not necessarily call for uniqueness. 

Definition 8. For a probability measure Q on M + , let tt(Q) de- 
note the Poisson Q-mixture. Then, given the pseudo-empirical 
measure <f) n , a minimum distance estimator with precision e 
is any probability measure pM D > m > e on jj+ t na j satisfies 



sup 

k 



F(k;n(P, 



MD. 



e ))-F(k;<t> n ) 



< infsup|P(fc;7r(Q)) -F(fc;<£„)| + e, 

Q k 



where the infimum is taken on probability measures supported 
on at most m points, on K + . 

We now provide the main consistency and rate results 
associated with such estimators. 

Theorem 4. Let r > be such that n r dyj(P n , P) — > 0, and 
assume that it is known that P is supported on at most m 
points. Let p^D,™^ ^ g a seauence y minimum distance 
estimators chosen such that e n < riT min {»"4/2}_ j] len as 
n — >• oo, we have that for any 5 > 0: 



0. 



(24) 



Remark: Since d-w induces the weak convergence topology, it 



also follows that F M D,m,e„ 



P. 



Proof: To derive rate results in [10], Chen establishes 
a bound on the Wasserstein distance between mixing distri- 
butions, using the Kolmogorov-Smirnov distance between the 
c.d.f.s of the resulting mixtures. For this, he first introduces 
a notion of strong identifiability (Definition 2, p. 225), and 
shows that Poisson mixtures satisfy it (Section 4, p. 228). He 
then shows (in Lemma 2, p. 225) that if we have strongly 
identifiable mixtures and if two mixing distributions have a 
support of at most m points within a fixed compact set, such 
as C, then we can find a constant M (which depends non- 
constructively on m and C), such that for any two such mixing 
distributions Q\ and Q2, we have: 

dw(Qi,Q 2 ) 2 < Afsup|F(fc;7r(Qi))-F(A:;7r(Q 2 ))| (25) 

fc 

The main burden of proof therefore falls on our Theorem [2] 
in establishing a Kolmogorov-Smirnov-type convergence for 
the pseudo-empirical measure. The argument we present next 
is based on Chen's proof (Theorem 2, p. 226). We have: 



sup 

fc 



F(k;w(P™ n ' m ^))-F(k;<t> n ) 



< sup 

fc 



F(k;ir(P, 



MD,: 



>))-F(k;\) 



+ sup\F(k;\)-F(k;<p n )\ 
fc 

< 2sup\F(k;X)-F{k;ct> n )\+e n , 
fc 

where the final inequality is due to the definition of P™ ,m,£ ". 
By Theorem 12 and by our choice of e n , it follows that for all 
5 > 0, we have: 



n 



sup 

fc 



"'"'i'-' 21 J ^.m F{k^{P^^)) ~ F{k^ n ) -> M . 0. 

(26) 

By combining d25l l and d26l >. the theorem follows . ■ 
Note that Chen's result can be used to show more. In 
particular, if we think of the true mixing distribution as 
residing in some neighborhood of a fixed distribution, then 
the convergence holds uniformly over that neighborhood. This 
may be interpreted as a form of robustness, but we do not dwell 
on it here. 

VI. Applications 

To solve canonical problems in the setting of Lemma [2] 
when an a priori bound on C is known or when / is bounded 
on R + , it suffices to construct a sequence of probability 
measures P n that weakly converges to P for almost every 
sample sequence. Since Theorem [3] provides such a sequence, 
we need not go further than that. 

However, to work within the more general setting of Lemma 
|3] where no knowledge of C is assumed and / can be any 
locally Lipschitz function, we can use the result of Theorem 
|4] In this section, we start by illustrating this for some of 
the quantities considered by WVK. We then suggest two new 
applications: alphabet size and support interval estimation. We 
conclude by remarking on some algorithmic considerations. 



A. Estimating Entropies and Probabilities 

First consider the entropy of the source H(p n ), and 
the associated problem, in normalized form, of estimating 
= H(p n ) — \ogn. One can then write: 



Y 



H 



logxdP„(a:), 



and therefore, by comparing to equation (fl~|i with f n (x) = 
f(x) — —log(x), we have a canonical estimation problem, 
and since Y^ is deterministic, it is also strong. If we have a 
bound on C, we can use Lemma |2] Otherwise, note that on 
intervals of the form D], logx is Z)-Lipshitz. Therefore 

if for some s > 0, n s dw(P n , P) — > a . s . 0, as given by Theorem 
|4]for example, then we can apply Lemma [3] using D n = n s . 
If s exists but is unknown, we can still apply Lemma [3] using 
any sequence that is o(n s ), such as D n = e log ™, for some 
e > 0. The consistent estimator becomes: 



Y 



H 



logn xdP n (x). 



(27) 



Next consider the probability of the sequence 
Pn(Xn,i: ' " ' 1 -Xn,n)i an d me associated normalized problem 
of estimating Y£ = i \ogp n (X n ,u ' ' ' ,Xn,n) + logn. We 
have (WVK, Lemma 5): 

E[y„ p ] = E[logp„(X„)]+logn 
logxdP n (x), 

and therefore we also have a canonical estimation problem. 
Using McDiarmid's theorem, one can also show that (WVK, 
Lemma 6) |E[5^f] — Y%\ — f a . s , 0, and therefore we once 
again have a strong canonical estimation problem, and we 
can construct a consistent estimator as in the case of entropy. 
Referring to equation (ffTJ), we have Y? = —Y**. 

B. Estimating the Alphabet Size 

Consider the size of the alphabet \A n \. Since the model 
describes large, asymptotically infinite, alphabets, we look at 
the normalized problem of estimating Y r f = \A n \/n. We have 

(cf. my. 



Y A = 



1 



Pn(a) 



E 1 = E 



I ~dP n (x). 

Jc x 



Once again, having a deterministic sequence of the form of 
dU with f n {x) = f(x) = 1/x, it follows that {K^jngN is a 
strong canonical problem. If we have a bound on C, we can 
use Lemma [2] Otherwise, note that on intervals of the form 
1/x is D 2 -Lipshitz. Therefore if for some s > 0, 
n s dyj(P n , P) — > a . s . 0, as given by Theorem [4] for example, 
then we can apply Lemma [3] using D n = n s l' 2 . As in Section 
IVI-AI if s exists but is unknown, we can still apply Lemma 
fusing any sequence that is o(n s ), such as D n = e log ™, for 
some e > 0. The consistent estimator becomes: 



Yi 



Cjj 1 dP n (x). 



(28) 



C. Estimating the Support Interval 

As discussed in Section ITl-BI estimating the support interval 
is not a canonical problem per se. However, we show here that 
we can extend the framework in a straightforward fashion to 
provide consistent estimators of both c and c. 

Lemma 6. Let P n => a . s . P such that for some s > 0, we 
have n s dw{Pn,P) — >a.s. 0. This is particularly true under 
the conditions of Theorem |5] Given q ^ and D > 1, let x q D 
denote the D-tapered version of x q . 

If q n — log nj log log n and D n — n s '^ 2qn ', then we have: 
as n — >• oo, 

l/9« 



and 



x D qn dP n (x) 



x q £ dP n (x) 



Proof: For conciseness, let us drop the argument of the 
probability measures, and write dP for dP(x). We provide 
the proof only for c, since the argument is analogous for c. 
Recall that c is the essential infimum of a random variable 
Z ~ P. Therefore, for any P > (c" 1 V c), we have: 



1/9 

~ 9 dP) ->C 



as g 



(29) 



In the absence of a rate of convergence, we cannot simply 
plug in P n . But since we know that n s d^(P n , P) — > a . s . 0, 
we can use the dual representation of the Wasserstein distance 
and the fact that for every q and D the function ■^D~ 1 ~~ q x 
is Lipschitz(l) over 



-Q 
D 



to state: as n 



n sup 

9,-d q 



' dP„ 



dp 



0. (30) 



We now want to relate this to the difference of the q th roots. 
Note that each of the integrals in ( f30b is bounded from below 
by D~ q . Using this and the fact that for any a and b > we 
have la 1 / 9 — b 1 / q \ < -{a A b)^^ 1 \a — b\, we can write: 



< D 2q 



D dP„ 

£)-l-8 



9 
1/9 



'dP 



1/9 



'dP„ 



dP 



The choices q n = log n/ log log n and D n = n s ^ 2qn \ allow 
us to have D 2q ™ = n s , and yet guarantee that as n — > oo both 
q n and P„ —5- oo. With this, we can use the convergence of 
equation d30b , to state: as n —> oo, 



x 



< n B 



£)-l-9n 
n 

q n 



1/9. 



dP 



1/9. 



dP„ 



s. 0. (31) 



We then combine (|29l and (K3TT > to complete the proof. 
Remarks. Note the following: 



(i) Other scaling schemes can be devised for q n and Z? n , 
as long as they both grow to oo as n — > oo, yet D 2 ^™ 
remains at most O (n s ). 

(ii) If a bound [D m [ n , -D max ] D C is already known, then we 
can taper x q accordingly, without growing D n . In this 
case, we can also speed up the rate of convergence by 
choosing q n = § logn/ log 

(iii) If only an upper bound or only a lower bound is known, 
we can taper x q accordingly, and only grow/shrink 
the missing bound. In this case we leave q n = 
log n I log log n as in the Lemma. 

(iv) In the Lemma and the alternatives in these remarks, if s is 
unknown we can replace it wherever it appears (together 
with constant factors) with a suitably decaying term, that 
guarantees the behavior of remark (i). For example, in the 
Lemma, we can choose D n = nV(9nvi°i i°g™) 5 since then 
D 2qn becomes o(n s ) for any s, and the proof applies. 

D. Algorithmic Considerations 

One of the appealing properties of the maximum likelihood 
estimator is that, by a result of Simar in J9)> it is supported on 
finitely many points. Simar also suggests a particular algorithm 
for obtaining the P^ 11 ^, the convergence of which was later 
established in 1111 . with further improvements. One can also 
solve for the MLE using the EM algorithm, as reviewed in 
fTZl . Penalized variants are also suggested, such as in [13 1. The 
literature on the non-parametric maximum likelihood estimator 
for mixtures is indeed very rich. As for the minimum distance 
estimator, in ifTOl Chen suggests variants of the work in lfT4ll . 
where they use algorithms based on linear programming. 
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